Goal:
The objective of this project is to better understand the types of customers who purchase Chips and their purchasing behaviour within the region(understanding purchasing trends and behaviour).Additionally we also aim to identify customer segmentation and provide a strategic recommendation to boost sales.
The datasets are provided by Quantium as part of the Forage data analytics simulation. It includes transaction and customer data from a retail store, capturing information about product sales, dates, and customer interactions. The datasets are structured to help analyze sales trends and customer behavior.
Key features include:
%%capture
pip install openpyxl
import pandas as pd
import matplotlib.pyplot as plt
import plotly.express as px
import math
fd=pd.read_excel('C:/Users/obalabi adepoju/Downloads/Quantium/transaction_data.xlsx')
df=pd.read_csv('C:/Users/obalabi adepoju/Downloads/Quantium/purchase_behaviour.csv')
We'll be starting with the purchase behaviour dataset.
let's check out the first 10 records of the dataset
df.head(10)
| LYLTY_CARD_NBR | LIFESTAGE | PREMIUM_CUSTOMER | |
|---|---|---|---|
| 0 | 1000 | YOUNG SINGLES/COUPLES | Premium |
| 1 | 1002 | YOUNG SINGLES/COUPLES | Mainstream |
| 2 | 1003 | YOUNG FAMILIES | Budget |
| 3 | 1004 | OLDER SINGLES/COUPLES | Mainstream |
| 4 | 1005 | MIDAGE SINGLES/COUPLES | Mainstream |
| 5 | 1007 | YOUNG SINGLES/COUPLES | Budget |
| 6 | 1009 | NEW FAMILIES | Premium |
| 7 | 1010 | YOUNG SINGLES/COUPLES | Mainstream |
| 8 | 1011 | OLDER SINGLES/COUPLES | Mainstream |
| 9 | 1012 | OLDER FAMILIES | Mainstream |
LYLTY_CARD_NBR : This column represents the loyalty card number of each customer.
LIFESTAGE : This column categorizes customers based on their life stage, indicating the demographic group they belong to.
PREMIUM_CUSTOMER : This column indicates the premium status of the customer, describing their spending behavior.
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 72637 entries, 0 to 72636 Data columns (total 3 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 LYLTY_CARD_NBR 72637 non-null int64 1 LIFESTAGE 72637 non-null object 2 PREMIUM_CUSTOMER 72637 non-null object dtypes: int64(1), object(2) memory usage: 1.7+ MB
print(f"This dataset has {df.shape[0]} rows, {df.shape[1]} columns and it has no null values.")
This dataset has 72637 rows, 3 columns and it has no null values.
Let's check for duplicate values in the loyalty card column as this represents our primary_ID
print(f"Duplicate Values: {df['LYLTY_CARD_NBR'].duplicated().any()}")
Duplicate Values: False
Let's examine our LIFESTAGE column.
lifestage_counts=df['LIFESTAGE'].value_counts()
lifestage_counts
LIFESTAGE RETIREES 14805 OLDER SINGLES/COUPLES 14609 YOUNG SINGLES/COUPLES 14441 OLDER FAMILIES 9780 YOUNG FAMILIES 9178 MIDAGE SINGLES/COUPLES 7275 NEW FAMILIES 2549 Name: count, dtype: int64
Let's visualize the distribution of each value.
%matplotlib inline
# Create the histogram
plt = px.histogram(df, x='LIFESTAGE',title='Customer Distribution by Lifestage Category',color='LIFESTAGE')
# Display the chart
plt.show()
Let's look at the premium_customer column
df['PREMIUM_CUSTOMER'].value_counts()
PREMIUM_CUSTOMER Mainstream 29245 Budget 24470 Premium 18922 Name: count, dtype: int64
plt = px.pie(df, names='PREMIUM_CUSTOMER', title='Customer Distribution',hole=0.4)
plt.show()
For a better understanding, we'll give a description of what each customer group means.
Now let us dive into the transaction dataset.
fd.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 264836 entries, 0 to 264835 Data columns (total 8 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 DATE 264836 non-null int64 1 STORE_NBR 264836 non-null int64 2 LYLTY_CARD_NBR 264836 non-null int64 3 TXN_ID 264836 non-null int64 4 PROD_NBR 264836 non-null int64 5 PROD_NAME 264836 non-null object 6 PROD_QTY 264836 non-null int64 7 TOT_SALES 264836 non-null float64 dtypes: float64(1), int64(6), object(1) memory usage: 16.2+ MB
print(f"This dataset has {fd.shape[0]} rows, {fd.shape[1]} columns and it has no null values.")
This dataset has 264836 rows, 8 columns and it has no null values.
# A preview of the dataset showing the first 10 records
fd.head(10)
| DATE | STORE_NBR | LYLTY_CARD_NBR | TXN_ID | PROD_NBR | PROD_NAME | PROD_QTY | TOT_SALES | |
|---|---|---|---|---|---|---|---|---|
| 0 | 43390 | 1 | 1000 | 1 | 5 | Natural Chip Compny SeaSalt175g | 2 | 6.0 |
| 1 | 43599 | 1 | 1307 | 348 | 66 | CCs Nacho Cheese 175g | 3 | 6.3 |
| 2 | 43605 | 1 | 1343 | 383 | 61 | Smiths Crinkle Cut Chips Chicken 170g | 2 | 2.9 |
| 3 | 43329 | 2 | 2373 | 974 | 69 | Smiths Chip Thinly S/Cream&Onion 175g | 5 | 15.0 |
| 4 | 43330 | 2 | 2426 | 1038 | 108 | Kettle Tortilla ChpsHny&Jlpno Chili 150g | 3 | 13.8 |
| 5 | 43604 | 4 | 4074 | 2982 | 57 | Old El Paso Salsa Dip Tomato Mild 300g | 1 | 5.1 |
| 6 | 43601 | 4 | 4149 | 3333 | 16 | Smiths Crinkle Chips Salt & Vinegar 330g | 1 | 5.7 |
| 7 | 43601 | 4 | 4196 | 3539 | 24 | Grain Waves Sweet Chilli 210g | 1 | 3.6 |
| 8 | 43332 | 5 | 5026 | 4525 | 42 | Doritos Corn Chip Mexican Jalapeno 150g | 1 | 3.9 |
| 9 | 43330 | 7 | 7150 | 6900 | 52 | Grain Waves Sour Cream&Chives 210G | 2 | 7.2 |
Let's take a look at the product name column.
fd['PROD_NAME'].value_counts()
PROD_NAME
Kettle Mozzarella Basil & Pesto 175g 3304
Kettle Tortilla ChpsHny&Jlpno Chili 150g 3296
Cobs Popd Swt/Chlli &Sr/Cream Chips 110g 3269
Tyrrells Crisps Ched & Chives 165g 3268
Cobs Popd Sea Salt Chips 110g 3265
...
RRD Pc Sea Salt 165g 1431
Woolworths Medium Salsa 300g 1430
NCC Sour Cream & Garden Chives 175g 1419
French Fries Potato Chips 175g 1418
WW Crinkle Cut Original 175g 1410
Name: count, Length: 114, dtype: int64
print(f"This dataset contains {fd['PROD_NAME'].nunique()} distinct chip names")
This dataset contains 114 distinct chip names
Now we check for duplicates in the loyalty card column
print(f"Duplicate Values: {fd['LYLTY_CARD_NBR'].duplicated().any()}")
Duplicate Values: True
#Removing Duplicates
fd=fd.drop_duplicates(subset=['LYLTY_CARD_NBR'],keep='first')
# Check if it has been removed
print(f"Duplicate Values: {fd['LYLTY_CARD_NBR'].duplicated().any()}")
Duplicate Values: False
fd.info()
<class 'pandas.core.frame.DataFrame'> Index: 72637 entries, 0 to 264835 Data columns (total 8 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 DATE 72637 non-null int64 1 STORE_NBR 72637 non-null int64 2 LYLTY_CARD_NBR 72637 non-null int64 3 TXN_ID 72637 non-null int64 4 PROD_NBR 72637 non-null int64 5 PROD_NAME 72637 non-null object 6 PROD_QTY 72637 non-null int64 7 TOT_SALES 72637 non-null float64 dtypes: float64(1), int64(6), object(1) memory usage: 5.0+ MB
print(f"This dataset has {fd.shape[0]} rows and {fd.shape[1]} columns and it has no null values.")
This dataset has 72637 rows and 8 columns and it has no null values.
Now we see number of records in transaction data is the same with purchase behaviour.
It's time to combine our dataset together.
data=pd.merge(df, fd, on='LYLTY_CARD_NBR', how='inner')
data.head(10)
| LYLTY_CARD_NBR | LIFESTAGE | PREMIUM_CUSTOMER | DATE | STORE_NBR | TXN_ID | PROD_NBR | PROD_NAME | PROD_QTY | TOT_SALES | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1000 | YOUNG SINGLES/COUPLES | Premium | 43390 | 1 | 1 | 5 | Natural Chip Compny SeaSalt175g | 2 | 6.0 |
| 1 | 1002 | YOUNG SINGLES/COUPLES | Mainstream | 43359 | 1 | 2 | 58 | Red Rock Deli Chikn&Garlic Aioli 150g | 1 | 2.7 |
| 2 | 1003 | YOUNG FAMILIES | Budget | 43531 | 1 | 3 | 52 | Grain Waves Sour Cream&Chives 210G | 1 | 3.6 |
| 3 | 1004 | OLDER SINGLES/COUPLES | Mainstream | 43406 | 1 | 5 | 96 | WW Original Stacked Chips 160g | 1 | 1.9 |
| 4 | 1005 | MIDAGE SINGLES/COUPLES | Mainstream | 43462 | 1 | 6 | 86 | Cheetos Puffs 165g | 1 | 2.8 |
| 5 | 1007 | YOUNG SINGLES/COUPLES | Budget | 43438 | 1 | 7 | 49 | Infuzions SourCream&Herbs Veg Strws 110g | 1 | 3.8 |
| 6 | 1009 | NEW FAMILIES | Premium | 43424 | 1 | 9 | 20 | Doritos Cheese Supreme 330g | 1 | 5.7 |
| 7 | 1010 | YOUNG SINGLES/COUPLES | Mainstream | 43352 | 1 | 10 | 51 | Doritos Mexicana 170g | 2 | 8.8 |
| 8 | 1011 | OLDER SINGLES/COUPLES | Mainstream | 43310 | 1 | 12 | 84 | GrnWves Plus Btroot & Chilli Jam 180g | 2 | 6.2 |
| 9 | 1012 | OLDER FAMILIES | Mainstream | 43539 | 1 | 16 | 20 | Doritos Cheese Supreme 330g | 1 | 5.7 |
We'll dive into this next bout by specifying some of the questions we'll be answering in our analysis.
Which chip brands do most customers prefer?
Is there a relationship between packet sizes and total sales?
What packet sizes do customers prefer?
Which customer stage and segment has the highest contribution to total sales?
Which customer segment are likely to spend above the average total sales?
We'll be looking at the chip brands customers purchased the most and finding out if there is a reason for the preference.
grouped = data.groupby('PROD_NAME').agg(count=('PROD_NAME', 'count'))
grouped = grouped.sort_values('count', ascending=False).reset_index() # Sorting by count
display(grouped)
| PROD_NAME | count | |
|---|---|---|
| 0 | Cobs Popd Swt/Chlli &Sr/Cream Chips 110g | 967 |
| 1 | Twisties Cheese Burger 250g | 949 |
| 2 | Smiths Crnkle Chip Orgnl Big Bag 380g | 947 |
| 3 | Pringles Sweet&Spcy BBQ 134g | 946 |
| 4 | Thins Chips Light& Tangy 175g | 944 |
| ... | ... | ... |
| 109 | Natural ChipCo Hony Soy Chckn175g | 342 |
| 110 | RRD Steak & Chimuchurri 150g | 338 |
| 111 | Red Rock Deli Chikn&Garlic Aioli 150g | 333 |
| 112 | Smiths Chip Thinly CutSalt/Vinegr175g | 333 |
| 113 | WW Crinkle Cut Original 175g | 315 |
114 rows × 2 columns
grouped.describe()
| count | |
|---|---|
| count | 114.000000 |
| mean | 637.166667 |
| std | 268.963067 |
| min | 315.000000 |
| 25% | 372.250000 |
| 50% | 631.000000 |
| 75% | 902.750000 |
| max | 967.000000 |
We'll be going into details of what a few of the rows imply
Looking at the information above, it suggests that the variability makes it harder to determine a single preferred choice or list of choices that applies broadly to the population.
Let's test out that theory by looking at the top 50 chip brands with the highest purchases
test=grouped[['PROD_NAME','count']].head(50)
test
| PROD_NAME | count | |
|---|---|---|
| 0 | Cobs Popd Swt/Chlli &Sr/Cream Chips 110g | 967 |
| 1 | Twisties Cheese Burger 250g | 949 |
| 2 | Smiths Crnkle Chip Orgnl Big Bag 380g | 947 |
| 3 | Pringles Sweet&Spcy BBQ 134g | 946 |
| 4 | Thins Chips Light& Tangy 175g | 944 |
| 5 | Doritos Corn Chip Southern Chicken 150g | 944 |
| 6 | Kettle Mozzarella Basil & Pesto 175g | 942 |
| 7 | Tyrrells Crisps Lightly Salted 165g | 941 |
| 8 | Kettle 135g Swt Pot Sea Salt | 939 |
| 9 | Pringles Original Crisps 134g | 933 |
| 10 | Kettle Tortilla ChpsHny&Jlpno Chili 150g | 932 |
| 11 | Tostitos Splash Of Lime 175g | 931 |
| 12 | Doritos Corn Chips Nacho Cheese 170g | 924 |
| 13 | Pringles SourCream Onion 134g | 924 |
| 14 | Kettle Tortilla ChpsBtroot&Ricotta 150g | 923 |
| 15 | Tostitos Smoked Chipotle 175g | 923 |
| 16 | Dorito Corn Chp Supreme 380g | 923 |
| 17 | Cheezels Cheese 330g | 918 |
| 18 | Pringles Barbeque 134g | 918 |
| 19 | Kettle Tortilla ChpsFeta&Garlic 150g | 917 |
| 20 | Kettle Sea Salt And Vinegar 175g | 916 |
| 21 | Infuzions Thai SweetChili PotatoMix 110g | 915 |
| 22 | Kettle Chilli 175g | 912 |
| 23 | Infuzions BBQ Rib Prawn Crackers 110g | 912 |
| 24 | Thins Chips Seasonedchicken 175g | 909 |
| 25 | Pringles Chicken Salt Crips 134g | 907 |
| 26 | Tyrrells Crisps Ched & Chives 165g | 905 |
| 27 | Smiths Crinkle Original 330g | 903 |
| 28 | Cobs Popd Sea Salt Chips 110g | 903 |
| 29 | Smiths Crinkle Chips Salt & Vinegar 330g | 902 |
| 30 | Doritos Corn Chips Cheese Supreme 170g | 901 |
| 31 | Thins Potato Chips Hot & Spicy 175g | 900 |
| 32 | Twisties Chicken270g | 899 |
| 33 | Thins Chips Salt & Vinegar 175g | 896 |
| 34 | Grain Waves Sour Cream&Chives 210G | 896 |
| 35 | Doritos Corn Chip Mexican Jalapeno 150g | 895 |
| 36 | Kettle Honey Soy Chicken 175g | 892 |
| 37 | Kettle Original 175g | 891 |
| 38 | Kettle Sensations Camembert & Fig 150g | 890 |
| 39 | Tostitos Lightly Salted 175g | 889 |
| 40 | Doritos Mexicana 170g | 888 |
| 41 | Infzns Crn Crnchers Tangy Gcamole 110g | 884 |
| 42 | Doritos Cheese Supreme 330g | 883 |
| 43 | Kettle Sensations Siracha Lime 150g | 880 |
| 44 | Old El Paso Salsa Dip Tomato Med 300g | 880 |
| 45 | Grain Waves Sweet Chilli 210g | 880 |
| 46 | Infuzions SourCream&Herbs Veg Strws 110g | 879 |
| 47 | Old El Paso Salsa Dip Chnky Tom Ht300g | 877 |
| 48 | Doritos Corn Chips Original 170g | 875 |
| 49 | Kettle Sensations BBQ&Maple 150g | 875 |
As expected, there is not enough difference between the count values of multiple chip brands to pinpoint the chip preference of customers. To prove this we'll be conducting a hypothesis test to determine if there is a statistical significance between the preference values of chip brands.
test_sample=test.sample(25)# use a subset of our test dataframe as the sample
test_sample
| PROD_NAME | count | |
|---|---|---|
| 48 | Doritos Corn Chips Original 170g | 875 |
| 3 | Pringles Sweet&Spcy BBQ 134g | 946 |
| 29 | Smiths Crinkle Chips Salt & Vinegar 330g | 902 |
| 16 | Dorito Corn Chp Supreme 380g | 923 |
| 15 | Tostitos Smoked Chipotle 175g | 923 |
| 20 | Kettle Sea Salt And Vinegar 175g | 916 |
| 49 | Kettle Sensations BBQ&Maple 150g | 875 |
| 33 | Thins Chips Salt & Vinegar 175g | 896 |
| 31 | Thins Potato Chips Hot & Spicy 175g | 900 |
| 19 | Kettle Tortilla ChpsFeta&Garlic 150g | 917 |
| 38 | Kettle Sensations Camembert & Fig 150g | 890 |
| 46 | Infuzions SourCream&Herbs Veg Strws 110g | 879 |
| 1 | Twisties Cheese Burger 250g | 949 |
| 44 | Old El Paso Salsa Dip Tomato Med 300g | 880 |
| 45 | Grain Waves Sweet Chilli 210g | 880 |
| 5 | Doritos Corn Chip Southern Chicken 150g | 944 |
| 24 | Thins Chips Seasonedchicken 175g | 909 |
| 13 | Pringles SourCream Onion 134g | 924 |
| 32 | Twisties Chicken270g | 899 |
| 17 | Cheezels Cheese 330g | 918 |
| 4 | Thins Chips Light& Tangy 175g | 944 |
| 37 | Kettle Original 175g | 891 |
| 34 | Grain Waves Sour Cream&Chives 210G | 896 |
| 41 | Infzns Crn Crnchers Tangy Gcamole 110g | 884 |
| 35 | Doritos Corn Chip Mexican Jalapeno 150g | 895 |
pop_mean=test['count'].mean()# Calculating the mean of our test population
pop_sd=test['count'].std()#Calculating the standard deviation of our test population
samp_mean=test_sample['count'].mean()# Sample mean
z_value =1.96
z_score= abs((samp_mean - pop_mean) / (pop_sd/math.sqrt(50)))
print(f"Z-Score: {z_score}")
Z-Score: 1.2615871363622293
Conclusion: Since test statistic is less than critical value i.e Z_Score < Z_ Value, we accept the null hypothesis and conclude there is no significant difference between the values in the population
# We'll be creating a new column called PACK_SIZE
data['PACK_SIZE(G)'] = data['PROD_NAME'].str.extract(r'(\d+)')
data['PACK_SIZE(G)']=pd.to_numeric(data['PACK_SIZE(G)'])
sales= data[['LYLTY_CARD_NBR','LIFESTAGE', 'PREMIUM_CUSTOMER','STORE_NBR' ,'TXN_ID', 'PROD_NBR','PROD_NAME','PACK_SIZE(G)','PROD_QTY','TOT_SALES']]
sales.head()
| LYLTY_CARD_NBR | LIFESTAGE | PREMIUM_CUSTOMER | STORE_NBR | TXN_ID | PROD_NBR | PROD_NAME | PACK_SIZE(G) | PROD_QTY | TOT_SALES | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1000 | YOUNG SINGLES/COUPLES | Premium | 1 | 1 | 5 | Natural Chip Compny SeaSalt175g | 175 | 2 | 6.0 |
| 1 | 1002 | YOUNG SINGLES/COUPLES | Mainstream | 1 | 2 | 58 | Red Rock Deli Chikn&Garlic Aioli 150g | 150 | 1 | 2.7 |
| 2 | 1003 | YOUNG FAMILIES | Budget | 1 | 3 | 52 | Grain Waves Sour Cream&Chives 210G | 210 | 1 | 3.6 |
| 3 | 1004 | OLDER SINGLES/COUPLES | Mainstream | 1 | 5 | 96 | WW Original Stacked Chips 160g | 160 | 1 | 1.9 |
| 4 | 1005 | MIDAGE SINGLES/COUPLES | Mainstream | 1 | 6 | 86 | Cheetos Puffs 165g | 165 | 1 | 2.8 |
sales['TOT_SALES'].describe()
count 72637.000000 mean 6.976057 std 3.769011 min 1.500000 25% 4.600000 50% 7.400000 75% 8.800000 max 650.000000 Name: TOT_SALES, dtype: float64
In analyzing our total sales column, we discover a highly ambigous value which is completely different from the rest of our values as most are in the range of 4-8.
For a better understanding, we'll query our dataset to understand the outliers.
sales[sales['TOT_SALES'] >= 50]
| LYLTY_CARD_NBR | LIFESTAGE | PREMIUM_CUSTOMER | STORE_NBR | TXN_ID | PROD_NBR | PROD_NAME | PACK_SIZE(G) | PROD_QTY | TOT_SALES | |
|---|---|---|---|---|---|---|---|---|---|---|
| 59694 | 226000 | OLDER FAMILIES | Premium | 226 | 226201 | 4 | Dorito Corn Chp Supreme 380g | 380 | 200 | 650.0 |
We find out that this is due to a ridiculously large product quantity and there is only one instance of this ambiguity and while it is entirely possible for the product quantity to be 200, we'll consider this entry to be a mistake so as not to affect our analyis.
# we'll replace the value with what we assume to be the correct entry
sales.iloc[59694,8]=2
sales.iloc[59694,9]=6.5
sales.loc[[59694]]
| LYLTY_CARD_NBR | LIFESTAGE | PREMIUM_CUSTOMER | STORE_NBR | TXN_ID | PROD_NBR | PROD_NAME | PACK_SIZE(G) | PROD_QTY | TOT_SALES | |
|---|---|---|---|---|---|---|---|---|---|---|
| 59694 | 226000 | OLDER FAMILIES | Premium | 226 | 226201 | 4 | Dorito Corn Chp Supreme 380g | 380 | 2 | 6.5 |
# We'll then visualize the relationship between packet sizes and total sales.
sales.plot.scatter(x='TOT_SALES',y='PACK_SIZE(G)',alpha=0.25)
<Axes: xlabel='TOT_SALES', ylabel='PACK_SIZE(G)'>
Looking at our scatter plot, it's fairly obvious there's no linear correlation between the packet sizes and total sales as the bulk of our data falls between (150 and 250)g as total sales increase.
We'll start off by grouping our dataset according to packet size and analyzing the sum of total sales for each size and number of occurences.
g=sales.groupby('PACK_SIZE(G)').agg(count=('PACK_SIZE(G)','count'),sales=('TOT_SALES','sum'))
g = g.sort_values('count',ascending=False).reset_index()# Sorting in descending order to view the top 5 chip preferences
g
| PACK_SIZE(G) | count | sales | |
|---|---|---|---|
| 0 | 175 | 18132 | 126605.80 |
| 1 | 150 | 11621 | 78375.40 |
| 2 | 134 | 7189 | 48462.60 |
| 3 | 110 | 6331 | 43646.80 |
| 4 | 170 | 5415 | 37756.30 |
| 5 | 300 | 4086 | 29120.50 |
| 6 | 165 | 4052 | 25246.40 |
| 7 | 330 | 3606 | 37392.00 |
| 8 | 380 | 1870 | 20505.15 |
| 9 | 210 | 1776 | 11588.40 |
| 10 | 270 | 1762 | 14779.80 |
| 11 | 200 | 1107 | 3520.70 |
| 12 | 250 | 949 | 7456.20 |
| 13 | 135 | 939 | 7110.60 |
| 14 | 160 | 786 | 2559.30 |
| 15 | 190 | 761 | 3344.70 |
| 16 | 90 | 757 | 2247.40 |
| 17 | 70 | 391 | 1636.80 |
| 18 | 220 | 385 | 1538.70 |
| 19 | 125 | 366 | 1310.40 |
| 20 | 180 | 356 | 1872.40 |
As we can see, the count of purchases for the first 5 sizes are in a league of their own compared to the remaining 16 pack sizes with the 175g and 150g pack sizes clearly being the most preferred option even among the top.
For emphasis sake, we'll be visualizing the top 5 pack sizes in comparison with the rest.
test = g.iloc[0:5,0:]
g = g.drop(range(0,5)).reset_index(drop=True)
test1= {"Pack" :['Top_5','others'],
"count":[test['count'].sum(),g['count'].sum()],
"sales":[test['sales'].sum(),g['sales'].sum()]
}
test1=pd.DataFrame(test1)
ax=test1['count'].plot.bar(color='purple')
# Setting the x-axis labels
ax.set_xticklabels(test1['Pack'])
plt.title('Count of Top_5 vs Others')
plt.xlabel('Pack')
plt.ylabel('Count')
Text(0, 0.5, 'Count')
There appears to be an unexpectedly huge difference between the other pack sizes and customers top 5 choices.
We'll now dive in deeper to see how this affects our total sales
fig, ax = plt.subplots()
ax.pie(test1['sales'], labels=test1['Pack'], autopct='%1.1f%%', startangle=90, colors=['lightblue', 'grey'], wedgeprops=dict(width=0.6))
plt.title('Distribution of Sales for Top_5 vs Others')
plt.show()
As expected, the overall sales of only the preferred pack sizes is relatively larger than the rest combined with sales of top_5 occupying more than half of total sales.
Next we'll be examining the top 5 choices themselves to gain an understanding of their distribution.
ax=test['count'].plot.bar()
# Setting the x-axis labels
ax.set_xticklabels(test['PACK_SIZE(G)'])
plt.title('Distribution of Top Choices')
plt.xlabel('Pack')
plt.ylabel('Count')
Text(0, 0.5, 'Count')
Looking at the chart, we see there is no correlation between the sizes of the pack and why customers prefer one over the other, this may be due to other factors not considered in our dataset which can be rectified by further data collection or customers simply make their decisions on other qualitative factors such as hand-fit or feel.
We also see customer seem to prefer the 175g size the most.
#Let's see how this affect our sales
ax=test['sales'].plot.bar(color='pink')
ax.set_xticklabels(test['PACK_SIZE(G)'])
plt.title('Distribution of Top Sales')
plt.xlabel('Pack')
plt.ylabel('Sale')
Text(0, 0.5, 'Sale')
Distribution of total sales seems to be very similar to choice as insights are exactly the same.
We'll begin our analyses by focusing on customer attributes first to find out which lifestages has the highest contribution to Total sales.
#grouping our data by LIFESTAGE and finding sum of each attribute.
sales.groupby('LIFESTAGE').agg(TSALE=('TOT_SALES','sum')).sort_values('TSALE',ascending=False)
| TSALE | |
|---|---|
| LIFESTAGE | |
| RETIREES | 103143.50 |
| OLDER SINGLES/COUPLES | 103122.70 |
| YOUNG SINGLES/COUPLES | 98539.70 |
| OLDER FAMILIES | 68468.50 |
| YOUNG FAMILIES | 64320.00 |
| MIDAGE SINGLES/COUPLES | 50912.70 |
| NEW FAMILIES | 17569.25 |
We see that the older section of our population contribute the most to total sales with the exception of young singles/couples, we assume this is due to a lesser number of dependents in the lifestages of these customers in comparison to the remaining attributes in our data.
# visualizing our findings using a histogram
map={
'YOUNG SINGLES/COUPLES': 'blue',
'YOUNG FAMILIES': 'grey',
'OLDER SINGLES/COUPLES': 'blue',
'MIDAGE SINGLES/COUPLES': 'grey',
'NEW FAMILIES': 'grey',
'OLDER FAMILIES': 'grey',
'RETIREES' : 'blue'
}
# Create the histogram
plt = px.histogram(data, x='LIFESTAGE',y='TOT_SALES',title='Sales Distribution by Lifestage Category',color='LIFESTAGE',color_discrete_map=map)
# Display the chart
plt.show()
We clearly see the 3 major customer lifestages to Total sales indicating their significant purchasing power and influence on sales.
Next we'll focus on customer segments to do a similar analysis.
#grouping our data by customer segments and finding sum of each group.
s=sales.groupby('PREMIUM_CUSTOMER').agg(Group=("PREMIUM_CUSTOMER",'first'),TSALE=('TOT_SALES','sum')).sort_values('TSALE',ascending=False)
Despite premium customers spending more on average compared to other groups, they are relatively rare. This scarcity is why mainstream customers contribute significantly to total sales because they form the bulk of our dataset. In contrast, budget customers, while spending less individually, make up for it with their larger numbers.
plt = px.pie(s, names='Group',values='TSALE',title='Sales Segmentation')
plt.show()
Firstly let's check what the average total sale of customers is.
sales['TOT_SALES'].describe()
count 72637.000000 mean 6.967198 std 2.917684 min 1.500000 25% 4.600000 50% 7.400000 75% 8.800000 max 29.500000 Name: TOT_SALES, dtype: float64
We see from our statistical summary of total sales column that the average person spends about $6.96 with the minimum being 1.5 and maximum being 30 although most of our data are in the range of 4.6 to 8.8.
To prove that, let's look at the total sales of customers greater than 9.0.
d=sales.TOT_SALES[sales['TOT_SALES']>9.0].count()
d
16965
print(f"Sales greater than 9.0 is only about {round(((d/72637) * 100),2)} % of our data.")
Sales greater than 9.0 is only about 23.36 % of our data.
#Let's look at the average sale of eac customer group.
s=sales.groupby('PREMIUM_CUSTOMER').agg(AVG_SALE=('TOT_SALES','mean')).sort_values('AVG_SALE',ascending=False)
s[['AVG_SALE']]
| AVG_SALE | |
|---|---|
| PREMIUM_CUSTOMER | |
| Mainstream | 7.072459 |
| Premium | 6.902973 |
| Budget | 6.891060 |
We see that the differences in average sale between each customer group and average total sale is not very significant enough to be considered.
Our analysis started with identifying which customer lifestages significantly contribute to total sales. We found that older singles/couples and retirees are the primary drivers of revenue, followed closely by young singles/couples. This trend likely arises from fewer dependents in these groups compared to others. Furthermore, we shifted our focus to premium customer groups and found out that mainstream customers stand out due to their higher numbers, enhancing their overall contribution to total sales.
We went even further in our analysis by grouping each premium group according to lifestage. We'll look at an explanatory visualization to explain our next insight.
Notably, retirees and older singles/couples emerge as key attributes across all segments, explaining their substantial impact on sales figures. Additionally, mainstream customers tend to have a significantly larger representation of young singles/couples enabling us to understand the key demographics in our customer segmentation.
In summary, our analysis highlights the importance of targeting key customer segments and focusing on popular pack sizes, such as 175g and 150g, to maximize sales and optimize marketing efforts. By aligning product offerings with customer preferences and effectively managing inventory, we can enhance overall sales performance.